113 research outputs found

    Assessing the Potential of Classical Q-learning in General Game Playing

    Get PDF
    After the recent groundbreaking results of AlphaGo and AlphaZero, we have seen strong interests in deep reinforcement learning and artificial general intelligence (AGI) in game playing. However, deep learning is resource-intensive and the theory is not yet well developed. For small games, simple classical table-based Q-learning might still be the algorithm of choice. General Game Playing (GGP) provides a good testbed for reinforcement learning to research AGI. Q-learning is one of the canonical reinforcement learning methods, and has been used by (Banerjee &\& Stone, IJCAI 2007) in GGP. In this paper we implement Q-learning in GGP for three small-board games (Tic-Tac-Toe, Connect Four, Hex)\footnote{source code: https://github.com/wh1992v/ggp-rl}, to allow comparison to Banerjee et al.. We find that Q-learning converges to a high win rate in GGP. For the ϵ\epsilon-greedy strategy, we propose a first enhancement, the dynamic ϵ\epsilon algorithm. In addition, inspired by (Gelly &\& Silver, ICML 2007) we combine online search (Monte Carlo Search) to enhance offline learning, and propose QM-learning for GGP. Both enhancements improve the performance of classical Q-learning. In this work, GGP allows us to show, if augmented by appropriate enhancements, that classical table-based Q-learning can perform well in small games.Comment: arXiv admin note: substantial text overlap with arXiv:1802.0594

    The Impatient May Use Limited Optimism to Minimize Regret

    Full text link
    Discounted-sum games provide a formal model for the study of reinforcement learning, where the agent is enticed to get rewards early since later rewards are discounted. When the agent interacts with the environment, she may regret her actions, realizing that a previous choice was suboptimal given the behavior of the environment. The main contribution of this paper is a PSPACE algorithm for computing the minimum possible regret of a given game. To this end, several results of independent interest are shown. (1) We identify a class of regret-minimizing and admissible strategies that first assume that the environment is collaborating, then assume it is adversarial---the precise timing of the switch is key here. (2) Disregarding the computational cost of numerical analysis, we provide an NP algorithm that checks that the regret entailed by a given time-switching strategy exceeds a given value. (3) We show that determining whether a strategy minimizes regret is decidable in PSPACE

    Answer Set Programming for Non-Stationary Markov Decision Processes

    Full text link
    Non-stationary domains, where unforeseen changes happen, present a challenge for agents to find an optimal policy for a sequential decision making problem. This work investigates a solution to this problem that combines Markov Decision Processes (MDP) and Reinforcement Learning (RL) with Answer Set Programming (ASP) in a method we call ASP(RL). In this method, Answer Set Programming is used to find the possible trajectories of an MDP, from where Reinforcement Learning is applied to learn the optimal policy of the problem. Results show that ASP(RL) is capable of efficiently finding the optimal solution of an MDP representing non-stationary domains

    Probabilistic inference for determining options in reinforcement learning

    Get PDF
    Tasks that require many sequential decisions or complex solutions are hard to solve using conventional reinforcement learning algorithms. Based on the semi Markov decision process setting (SMDP) and the option framework, we propose a model which aims to alleviate these concerns. Instead of learning a single monolithic policy, the agent learns a set of simpler sub-policies as well as the initiation and termination probabilities for each of those sub-policies. While existing option learning algorithms frequently require manual specification of components such as the sub-policies, we present an algorithm which infers all relevant components of the option framework from data. Furthermore, the proposed approach is based on parametric option representations and works well in combination with current policy search methods, which are particularly well suited for continuous real-world tasks. We present results on SMDPs with discrete as well as continuous state-action spaces. The results show that the presented algorithm can combine simple sub-policies to solve complex tasks and can improve learning performance on simpler tasks

    Game theory of mind

    Get PDF
    This paper introduces a model of ‘theory of mind’, namely, how we represent the intentions and goals of others to optimise our mutual interactions. We draw on ideas from optimum control and game theory to provide a ‘game theory of mind’. First, we consider the representations of goals in terms of value functions that are prescribed by utility or rewards. Critically, the joint value functions and ensuing behaviour are optimised recursively, under the assumption that I represent your value function, your representation of mine, your representation of my representation of yours, and so on ad infinitum. However, if we assume that the degree of recursion is bounded, then players need to estimate the opponent's degree of recursion (i.e., sophistication) to respond optimally. This induces a problem of inferring the opponent's sophistication, given behavioural exchanges. We show it is possible to deduce whether players make inferences about each other and quantify their sophistication on the basis of choices in sequential games. This rests on comparing generative models of choices with, and without, inference. Model comparison is demonstrated using simulated and real data from a ‘stag-hunt’. Finally, we note that exactly the same sophisticated behaviour can be achieved by optimising the utility function itself (through prosocial utility), producing unsophisticated but apparently altruistic agents. This may be relevant ethologically in hierarchal game theory and coevolution

    Assessing the Potential of Classical Q-learning in General Game Playing

    Get PDF
    After the recent groundbreaking results of AlphaGo and AlphaZero, we have seen strong interests in deep reinforcement learning and artificial general intelligence (AGI) in game playing. However, deep learning is resource-intensive and the theory is not yet well developed. For small games, simple classical table-based Q-learning might still be the algorithm of choice. General Game Playing (GGP) provides a good testbed for reinforcement learning to research AGI. Q-learning is one of the canonical reinforcement learning methods, and has been used by (Banerjee & Stone, IJCAI 2007) in GGP. In this paper we implement Q-learning in GGP for three small-board games (Tic-Tac-Toe, Connect Four, Hex), to allow comparison to Banerjee et al. We find that Q-learning converges to a high win rate in GGP. For the ϵ" role="presentation" style="display: inline-table; line-height: normal; letter-spacing: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border-width: 0px; border-style: initial; position: relative;">ϵ-greedy strategy, we propose a first enhancement, the dynamic ϵ" role="presentation" style="display: inline-table; line-height: normal; letter-spacing: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border-width: 0px; border-style: initial; position: relative;">ϵ algorithm. In addition, inspired by (Gelly & Silver, ICML 2007) we combine online search (Monte Carlo Search) to enhance offline learning, and propose QM-learning for GGP. Both enhancements improve the performance of classical Q-learning. In this work, GGP allows us to show, if augmented by appropriate enhancements, that classical table-based Q-learning can perform well in small games.Computer Systems, Imagery and Medi

    Learning and innovative elements of strategy adoption rules expand cooperative network topologies

    Get PDF
    Cooperation plays a key role in the evolution of complex systems. However, the level of cooperation extensively varies with the topology of agent networks in the widely used models of repeated games. Here we show that cooperation remains rather stable by applying the reinforcement learning strategy adoption rule, Q-learning on a variety of random, regular, small-word, scale-free and modular network models in repeated, multi-agent Prisoners Dilemma and Hawk-Dove games. Furthermore, we found that using the above model systems other long-term learning strategy adoption rules also promote cooperation, while introducing a low level of noise (as a model of innovation) to the strategy adoption rules makes the level of cooperation less dependent on the actual network topology. Our results demonstrate that long-term learning and random elements in the strategy adoption rules, when acting together, extend the range of network topologies enabling the development of cooperation at a wider range of costs and temptations. These results suggest that a balanced duo of learning and innovation may help to preserve cooperation during the re-organization of real-world networks, and may play a prominent role in the evolution of self-organizing, complex systems.Comment: 14 pages, 3 Figures + a Supplementary Material with 25 pages, 3 Tables, 12 Figures and 116 reference

    Reinforcement learning or active inference?

    Get PDF
    This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain
    • …
    corecore